Ruminations about the Introduction to Open Data Science (IODS) 2021
The course Introduction to Open Data Science (IODS) feels like a good introduction to the R-Rstudio-git pipeline advocating for reproducible well-documented research.
Challenges may arise due to
I hope the course will generate mindsets favoring open research equipped with technical skills to produce good research output.
I hope to learn not only technical skills but also ways to endorse open data science.
I found the course in Sisu. Apparently it is also found in the MOOC platform.
date()
## [1] "Sat Nov 20 12:56:39 2021"
This week, we are analyzing and modeling a survey data concerning the approaches to learning. Description of the data can be found here and here.
The data has been preprocessed to include variables on gender, age, attitude, exam points, and scores on deep, surface and strategic questions.
Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.
# read in the data
learning2014 <- read.csv("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt", header = TRUE, sep = ",")
dim(learning2014)
## [1] 166 7
There are 166 rows and 7 columns.
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The variables are mostly numeric. Gender is of character value, change it to factor.
# change to factor
learning2014$gender <- as.factor(learning2014$gender)
summary(learning2014)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
Seems like most of the students are female, age spans 17-55, exam points are between 7-33 and the question variables are capped at 5 (from documentation, on a scale 1-5).
Using GGlally::ggpairs, we can plot the distribution of the variables, correlation (with significance), boxplots and scatterplots stratified by the gender.
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
p <- ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 3)))
p
We see that
points and attitudesurf and deep questionssurf and attitudesurf and stra questionsFor pedagogical reasons, let’s fit a simple multivariable linear regression first with four (instead of three; choosing three would not change the results) explanatory variables, ie. regress exam points on other variables. Choose variables based on their absolute correlation with the exam points.
sort(abs(cor(learning2014[, -1])[, c("points")]))
## deep age surf stra attitude points
## 0.01014849 0.09319032 0.14435642 0.14612247 0.43652453 1.00000000
model <- lm(points ~ attitude + stra + surf + age, data = learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.8639 -3.2849 0.3086 3.4548 10.7028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.70644 3.96234 3.459 0.000694 ***
## attitude 3.38964 0.57041 5.942 1.68e-08 ***
## stra 0.93123 0.53986 1.725 0.086459 .
## surf -0.76565 0.80258 -0.954 0.341524
## age -0.09466 0.05346 -1.771 0.078502 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.262 on 161 degrees of freedom
## Multiple R-squared: 0.2226, Adjusted R-squared: 0.2033
## F-statistic: 11.52 on 4 and 161 DF, p-value: 2.987e-08
From the summary, the intercept is very significantly non-zero based on a t-test (H_0: intercept is zero, H_1: intercept is not zero) although the result is not very meaningful as e.g. age cannot be zero in this context.
The attitude variable has a very significant effect on the exam points (p-value about zero for a t-test for H_0: attitude has no effect on the slope when the other variables are kept constant, H_1: attitude has an effect). Variables stra and age are weakly significant or non-significant depending on the nomenclature (same test). Variable surf is not significant, let’s remove it and refit.
model <- lm(points ~ attitude + stra + age, data = learning2014)
summary(model)
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 3.48077 0.56220 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Now all explanatory variables are significant with a 0.10 significance cutoff.
The F-test is significant, some of the variables are therefore associated with the response variable (have non-zero coefficient).
From the summary above, we see that as attitude increases by one, exam points increase by about 3.48 when other variables are kept constant. For stra, the value is about 1.00. Age seems to have a smaller effect, with an additional year decreasing exam points by around 0.09 if other variables do not change.
Multiple R-squared is 0.2182, meaning that the three explanatory variables account for approximately 22% variation in the exam points. This seems quite low and we should be careful when e.g. predicting exam points for new students with this model. The adjusted R-squared considers the number of explanatory variables and has here a similar, a bit smaller, value to the unadjusted R-squared.
Linear models carry assumptions that need to be verified. We use diagnostic plots for visual assessment.
# residuals vs fitted values
plot(model, which = 1)
There doesn’t seem to be a clear pattern (maybe visually somewhat of a funnel with decreasing variance but Breusch-Pagan test car::ncvTest(model) doesn’t detect heteroskedasticity) in the residuals versus fitted values in the sense that the variance in residuals is similar across the fitted values. Also, the variation is approximately symmetrical around zero. Constant variance assumption appears to be met. Linear model appears to be OK for the data.
Three outliers (set by id.n in plot.lm) are detected (rows 35, 56 and 145), let’s check them.
print(learning2014[c(35, 56, 145), ])
## gender age attitude deep stra surf points
## 35 M 20 4.2 4.500000 3.250 1.583333 10
## 56 F 45 3.8 3.000000 3.125 3.250000 9
## 145 F 21 3.5 3.916667 3.875 3.500000 7
annotate <- c(35, 56, 145, 4, 2) # 4 and 2 for leverage below
# make the plot tighter
op <- par(mfrow=c(3, 1),
mar = c(4, 4, 1, 1),
mgp = c(2, 1, 0))
parameters <- c("attitude", "stra", "age")
for (param in parameters) {
print(param)
plot(learning2014[, param], learning2014$points,
xlab = param, ylab = "points")
text(learning2014[annotate, param],
learning2014[annotate, "points"],
annotate, pos = 4, col = "red")
}
## [1] "attitude"
## [1] "stra"
## [1] "age"
par(op) # return original par
It appears that here are the students that got low exam points despite having a high attitude value.
Plot boxplot for assessment of symmetry of the residual distribution.
boxplot(resid(model))
Does not seem too bad, but is not perfectly symmetric. QQ-plot will aid in deciding normality.
# qq-plot
plot(model, which = 2)
Standardized residuals follow the linear pattern quite reasonably with the same outlier exceptions as above. Hence, there is no strong reason to suspect deviation from normality in the distribution of the residuals.
# residuals vs. leverage
plot(model, which = 5)
Observations 2, 4 and 56 have been marked as outliers, they have relatively high influence of the regression line compared to other observations. We can rerun the model without these to see if the multiple R-squared is increased.
Cook’s distance is also low, supporting the notion that there are no outliers having a great effect on the linear fit.
max(cooks.distance(model))
## [1] 0.1202162
summary(lm(points ~ attitude + stra + age, data = learning2014[-c(2, 4, 56), ]))
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014[-c(2,
## 4, 56), ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.822 -3.520 0.348 3.617 10.919
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.38472 2.58566 3.243 0.00144 **
## attitude 3.64594 0.53695 6.790 2.11e-10 ***
## stra 0.84219 0.51066 1.649 0.10108
## age 0.01968 0.05677 0.347 0.72930
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.008 on 159 degrees of freedom
## Multiple R-squared: 0.2419, Adjusted R-squared: 0.2276
## F-statistic: 16.91 on 3 and 159 DF, p-value: 1.394e-09
Multiple R-square is indeed a bit better. However, removal of the outliers would have to be justified from the data (e.g. are these students somehow different).
library(car)
## Loading required package: carData
vif(model)
## attitude stra age
## 1.004083 1.014233 1.010865
Variance inflation factors are less than 10 (textbook), so there is no concern for collinearity.
We can try automatic stepwise model selection by AIC criterion.
library(MASS)
model.full <- lm(points ~ gender + age + attitude + deep + stra + surf,
data = learning2014)
stepAIC(model.full, direction = "both")
## Start: AIC=558.36
## points ~ gender + age + attitude + deep + stra + surf
##
## Df Sum of Sq RSS AIC
## - gender 1 0.11 4408.5 556.36
## - surf 1 47.79 4456.1 558.15
## - deep 1 49.00 4457.3 558.19
## <none> 4408.3 558.36
## - stra 1 83.78 4492.1 559.48
## - age 1 87.88 4496.2 559.63
## - attitude 1 919.82 5328.2 587.82
##
## Step: AIC=556.36
## points ~ age + attitude + deep + stra + surf
##
## Df Sum of Sq RSS AIC
## - surf 1 47.69 4456.1 556.15
## - deep 1 49.11 4457.6 556.20
## <none> 4408.5 556.36
## - stra 1 88.35 4496.8 557.66
## - age 1 90.16 4498.6 557.72
## + gender 1 0.11 4408.3 558.36
## - attitude 1 999.18 5407.6 588.27
##
## Step: AIC=556.15
## points ~ age + attitude + deep + stra
##
## Df Sum of Sq RSS AIC
## - deep 1 26.61 4482.8 555.14
## <none> 4456.1 556.15
## + surf 1 47.69 4408.5 556.36
## - age 1 75.36 4531.5 556.93
## - stra 1 106.07 4562.2 558.05
## + gender 1 0.02 4456.1 558.15
## - attitude 1 1084.38 5540.5 590.30
##
## Step: AIC=555.14
## points ~ age + attitude + stra
##
## Df Sum of Sq RSS AIC
## <none> 4482.8 555.14
## - age 1 76.62 4559.4 555.95
## + deep 1 26.61 4456.1 556.15
## + surf 1 25.20 4457.6 556.20
## - stra 1 97.64 4580.4 556.71
## + gender 1 0.01 4482.8 557.14
## - attitude 1 1060.72 5543.5 588.39
##
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
##
## Coefficients:
## (Intercept) age attitude stra
## 10.89543 -0.08822 3.48077 1.00371
The produced model is the same we derived earlier.
Finally, we can try to brute force all linear models of simple combination of explanatory variables.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:MASS':
##
## cement
## The following object is masked from 'package:datasets':
##
## rivers
res <- olsrr::ols_step_all_possible(model.full)
res[order(res$adjr, decreasing = TRUE), ]
## Index N Predictors R-Square Adj. R-Square
## 62 57 5 age attitude deep stra surf 0.2311318189 0.207104688
## 33 22 3 age attitude stra 0.2181719645 0.203693668
## 52 42 4 age attitude deep stra 0.2228135019 0.203504521
## 54 43 4 age attitude stra surf 0.2225665343 0.203251417
## 63 63 6 gender age attitude deep stra surf 0.2311510113 0.202137842
## 43 44 4 gender age attitude stra 0.2181729919 0.198748718
## 57 58 5 gender age attitude deep stra 0.2228168858 0.198529914
## 59 59 5 gender age attitude stra surf 0.2226046106 0.198311005
## 53 45 4 age attitude deep surf 0.2157221484 0.196236984
## 56 46 4 attitude deep stra surf 0.2154077956 0.195914822
## 17 7 2 attitude stra 0.2048096617 0.195052725
## 38 23 3 attitude deep stra 0.2096692889 0.195033535
## 34 24 3 age attitude surf 0.2081990315 0.193536051
## 40 25 3 attitude stra surf 0.2074263400 0.192749050
## 58 60 5 gender age attitude deep surf 0.2165385715 0.192055402
## 12 8 2 age attitude 0.2011434849 0.191341564
## 61 61 5 gender attitude deep stra surf 0.2158241415 0.191318646
## 27 26 3 gender attitude stra 0.2050965812 0.190376147
## 48 47 4 gender attitude deep stra 0.2098633360 0.190232611
## 32 27 3 age attitude deep 0.2043132366 0.189578297
## 44 48 4 gender age attitude surf 0.2090640320 0.189413449
## 50 49 4 gender attitude stra surf 0.2079025033 0.188223062
## 39 28 3 attitude deep surf 0.2023788945 0.187608133
## 22 29 3 gender age attitude 0.2017804149 0.186998571
## 3 1 1 attitude 0.1905536672 0.185618019
## 18 9 2 attitude surf 0.1952865878 0.185412804
## 42 50 4 gender age attitude deep 0.2048827672 0.185128302
## 49 51 4 gender attitude deep surf 0.2040700385 0.184295381
## 16 10 2 attitude deep 0.1939911024 0.184101423
## 28 30 3 gender attitude surf 0.1970278452 0.182157990
## 8 11 2 gender attitude 0.1919357660 0.182020867
## 26 31 3 gender attitude deep 0.1952588802 0.180356267
## 47 52 4 gender age stra surf 0.0653293394 0.042107708
## 60 62 5 gender age deep stra surf 0.0707276010 0.041687839
## 37 32 3 age stra surf 0.0520482669 0.034493605
## 55 53 4 age deep stra surf 0.0568671481 0.033435276
## 24 33 3 gender age stra 0.0504456777 0.032861338
## 31 34 3 gender stra surf 0.0462022812 0.028539360
## 45 54 4 gender age deep stra 0.0514840047 0.027918390
## 51 55 4 gender deep stra surf 0.0510011757 0.027423565
## 21 12 2 stra surf 0.0363412029 0.024517169
## 25 35 3 gender age surf 0.0420448279 0.024304917
## 41 36 3 deep stra surf 0.0407134651 0.022948900
## 10 13 2 gender stra 0.0346692266 0.022824677
## 46 56 4 gender age deep surf 0.0462679619 0.022572756
## 15 14 2 age surf 0.0340077514 0.022155086
## 14 15 2 age stra 0.0331743748 0.021311484
## 36 37 3 age deep surf 0.0379369858 0.020121004
## 29 38 3 gender deep stra 0.0357520229 0.017895579
## 35 39 3 age deep stra 0.0336895603 0.015794923
## 5 2 1 stra 0.0213517768 0.015384410
## 6 3 1 surf 0.0208387755 0.014868280
## 11 16 2 gender surf 0.0267878521 0.014846599
## 30 40 3 gender deep surf 0.0306219089 0.012670463
## 20 17 2 deep surf 0.0244545065 0.012484623
## 19 18 2 deep stra 0.0219453518 0.009944681
## 7 19 2 gender age 0.0196556536 0.007626889
## 2 4 1 age 0.0086844365 0.002639829
## 1 5 1 gender 0.0086318623 0.002586935
## 23 41 3 gender age deep 0.0198419947 0.001690921
## 9 20 2 gender deep 0.0088743608 -0.003286690
## 13 21 2 age deep 0.0087454938 -0.003417138
## 4 6 1 deep 0.0001029919 -0.005993941
## Mallow's Cp
## 62 5.003969
## 33 3.684101
## 52 4.724219
## 54 4.775293
## 63 7.000000
## 43 5.683889
## 57 6.723519
## 59 6.767418
## 53 6.190730
## 56 6.255739
## 17 4.447461
## 38 5.442477
## 34 5.746530
## 40 5.906325
## 58 8.021891
## 12 5.205636
## 61 8.169637
## 27 6.388125
## 48 7.402347
## 32 6.550123
## 44 7.567646
## 50 7.807853
## 39 6.950150
## 22 7.073917
## 3 5.395638
## 18 6.416857
## 42 8.432342
## 49 8.600417
## 16 6.684767
## 28 8.056761
## 8 7.109816
## 26 8.422587
## 47 37.292359
## 60 38.175985
## 37 38.038920
## 55 39.042363
## 24 38.370340
## 31 39.247886
## 45 40.155611
## 51 40.255461
## 21 39.287183
## 25 40.107658
## 41 40.382987
## 10 39.632952
## 46 41.234303
## 15 39.769746
## 14 39.942091
## 36 40.957170
## 29 41.409026
## 35 41.835549
## 5 40.387035
## 6 40.493125
## 11 41.262841
## 30 42.469948
## 20 41.745383
## 19 42.264283
## 7 42.737798
## 2 43.006675
## 1 43.017547
## 23 44.699262
## 9 44.967398
## 13 44.994048
## 4 44.781340
By adjusted R-Square, it appears that the three variable model we used before fares very well compared to the full model. Attitude alone is also quite good in comparison and could be chosen by parsimony.
date()
## [1] "Sat Nov 20 12:56:50 2021"
This week, we are analyzing and modeling data on two questionnaires related to student performance and alcohol consumption. Further description is available here. Since the description is relevant for analysis we reprint it verbatim:
“This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).”
The dataset has been preprocessed by joining data from two courses, Math and Portugese language. Variables alc_use and high_use have been added in the wrangling phase.
Begin by loading the analysis data. Let’s use the “official” csv-file; an R script to produce the wrangled data is in the repo.
rm(list=ls())
# read in the data, convert strings to factors
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv",
header = TRUE, sep = ",", stringsAsFactors = TRUE)
dim(alc)
## [1] 370 51
There are 370 rows and 51 columns.
str(alc)
## 'data.frame': 370 obs. of 51 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 1 1 1 1 1 1 1 1 1 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 1 1 1 1 1 2 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 2 2 2 2 2 2 2 2 2 1 ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 3 1 4 4 4 4 2 3 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 3 3 3 2 4 2 5 4 3 2 ...
## $ reason : Factor w/ 4 levels "course","home",..: 2 4 4 1 4 1 1 4 4 4 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 2 2 2 3 2 1 2 1 1 ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 1 2 1 2 ...
## $ famsup : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 1 1 1 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 1 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 2 1 1 2 1 2 1 1 1 ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : Factor w/ 2 levels "no","yes": 2 1 1 1 2 1 1 1 1 1 ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 1 2 1 1 ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
Seems like there are mostly integer and character (factorized) valued variables, as well as at least one logical (high_use) and float (alc_use).
# (Type for factor is int.)
table(sapply(alc, typeof))
##
## double integer logical
## 1 49 1
# Check for zero variance (negate characters to get "not all duplicated")
sort(
sapply(alc, function(x) {
ifelse(is.numeric(x), var(x), !all(duplicated(x)[-1L]))
}))
## n failures.p failures traveltime failures.m studytime
## 0.000000e+00 2.398667e-01 3.109939e-01 4.916502e-01 5.049513e-01 7.189922e-01
## Dalc famrel freetime alc_use school sex
## 8.032594e-01 8.304695e-01 9.712224e-01 9.958178e-01 1.000000e+00 1.000000e+00
## address famsize Pstatus Mjob Fjob reason
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## guardian schoolsup famsup activities nursery higher
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## internet romantic paid paid.p paid.m high_use
## 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00 1.000000e+00
## Medu Fedu goout age Walc health
## 1.173984e+00 1.179697e+00 1.273720e+00 1.393987e+00 1.666366e+00 1.981220e+00
## G2.p G1.p G1 G2 G3.p G3
## 6.078518e+00 6.512854e+00 7.301699e+00 7.914627e+00 8.665092e+00 1.080868e+01
## G1.m G2.m G3.m absences.p absences absences.m
## 1.119153e+01 1.444749e+01 2.124131e+01 2.330626e+01 3.029392e+01 5.876224e+01
## cid id.m id.p
## 1.143917e+04 1.274368e+04 3.041907e+04
# Seems OK, only `n` has zero variance.
Begin with the GGally::ggpairs
library(GGally)
library(ggplot2)
p <- ggpairs(alc.sub, mapping = aes(col = sex, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)),
upper = list(continuous = wrap("cor", size = 3)))
p
We see that
G3 is similar for both sexes, and is not normal (by shapiro.test()), has a peak and is left skewedfamrel correlates with alc_use, but significantly only for malesgoout correlates positively with alcohol use with a high significancestudytime correlates negatively with alcohol use with a high significanceAll in all, there are several hypotheses
Let’s take a closer look on the quality of family relationship versus alcohol use
library(tidyverse)
alc_var <- function(alc, expl.variable) {
expl.vars <- c(expl.variable, "alc_use", "sex")
# Tidyverse black magic
res <-
alc %>%
select(one_of(expl.vars)) %>%
count(!!!sapply(expl.vars, as.symbol))
g <- ggplot(res, aes(x = .data[[expl.variable]], y = alc_use, col = sex)) +
geom_point(aes(size = n), alpha = 0.5) + geom_smooth(method = "loess")
return(g)
}
alc_var(alc, "famrel")
With a bit of caution, it appears than men with low quality famrel use more alcohol, as there is an increase in alc_use when famrel goes from medium to bad. However, the relationship is roughly linear (without deep analysis):
summary(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
##
## Call:
## lm(formula = alc_use ~ famrel, data = alc[alc$sex == "M", ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7458 -0.9118 -0.1898 0.8102 3.0882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.30184 0.38377 8.604 4.55e-15 ***
## famrel -0.27800 0.09368 -2.968 0.00343 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.128 on 173 degrees of freedom
## Multiple R-squared: 0.04844, Adjusted R-squared: 0.04294
## F-statistic: 8.807 on 1 and 173 DF, p-value: 0.003427
# make the plot tighter
op <- par(mfrow=c(2, 2),
mar = c(4, 4, 2, 1),
mgp = c(2, 1, 0))
plot(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
par(op) # return original par
library(car)
# There is no heteroscedasticity by Breusch-Pagan test
car::ncvTest(lm(alc_use ~ famrel, data = alc[alc$sex == "M", ]))
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.398195, Df = 1, p = 0.52802
Next, plot the other variables as well
library(ggpubr)
g1 <- alc_var(alc, "goout")
g2 <- alc_var(alc, "sex")
g3 <- alc_var(alc, "studytime")
g4 <- alc_var(alc, "G3")
ggarrange(g1, g2, g3, g4,
ncol = 4, nrow = 1, common.legend = TRUE,
legend = "bottom")
When goout increases, so does the alcohol use. Males appear to use more alcohol. As studytime increases, alcohol usage drops. G3 does not have a clear linear pattern with alcohol use. Otherwise, the above hypotheses are OK.
Now we are ready to fit the logistic regression model with G3, famrel, goout, sex and studytime.
model.1 <- glm(paste0("high_use", " ~ ", paste(interesting_vars[-1], collapse=" + ")),
family = "binomial", data = alc)
summary(model.1)
##
## Call:
## glm(formula = paste0("high_use", " ~ ", paste(interesting_vars[-1],
## collapse = " + ")), family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6895 -0.7784 -0.4891 0.8155 2.6807
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.85026 0.85893 -0.990 0.32222
## G3 -0.03667 0.04031 -0.910 0.36294
## famrel -0.41604 0.13989 -2.974 0.00294 **
## goout 0.77038 0.12402 6.212 5.25e-10 ***
## sexM 0.80171 0.26695 3.003 0.00267 **
## studytime -0.46034 0.17229 -2.672 0.00754 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.86 on 364 degrees of freedom
## AIC: 381.86
##
## Number of Fisher Scoring iterations: 4
We got a model with four significant variables. G3 is not significant, remove it and refit.
model.2 <- glm(high_use ~ famrel + goout + sex + studytime,
family = "binomial", data = alc)
summary(model.2)
##
## Call:
## glm(formula = high_use ~ famrel + goout + sex + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5891 -0.7629 -0.4976 0.8304 2.6937
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2672 0.7312 -1.733 0.08309 .
## famrel -0.4193 0.1399 -2.996 0.00273 **
## goout 0.7873 0.1230 6.399 1.57e-10 ***
## sexM 0.7959 0.2669 2.982 0.00286 **
## studytime -0.4814 0.1711 -2.814 0.00490 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 370.69 on 365 degrees of freedom
## AIC: 380.69
##
## Number of Fisher Scoring iterations: 4
There is a lot of output
high_use ~ 1)# Residual deviance
sum(residuals(model.2, type = "deviance")^2)
## [1] 370.6854
See if residuals are roughly linear, otherwise we can’t use GLM.
# Plot residuals and check that linearity is OK
plot(model.2, which = 1) # -> Looks OK.
We can also test with ANOVA if null model is significantly different from our model
# see http://homepage.stat.uiowa.edu/~rdecook/stat3200/notes/LogReg_part2_4pp.pdf
# and https://cran.r-project.org/web/packages/HSAUR/vignettes/Ch_logistic_regression_glm.pdf
# and https://stats.stackexchange.com/questions/59879/
# 1-pchisq(model.2$null.deviance-model.2$deviance, model.2$df.null-model.2$df.residual)
null <- glm(high_use ~ 1, family = "binomial", data = alc)
anova(null, model.2, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: high_use ~ 1
## Model 2: high_use ~ famrel + goout + sex + studytime
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 369 452.04
## 2 365 370.69 4 81.354 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We conclude that at least one of the explanatory variables has non-zero coefficient by this test.
OR <- coef(model.2) %>% exp
CI <- confint(model.2) %>% exp
list(OR = OR, CI = CI)
## $OR
## (Intercept) famrel goout sexM studytime
## 0.2816141 0.6575181 2.1974324 2.2165443 0.6179355
##
## $CI
## 2.5 % 97.5 %
## (Intercept) 0.06545486 1.1622100
## famrel 0.49791884 0.8636137
## goout 1.73873662 2.8198312
## sexM 1.31841735 3.7630180
## studytime 0.43751933 0.8576353
The odds ratio for famrel is 0.66 and 95% CI is [0.50, 0.86]. Thus, increase by 1 unit in famrel is associated with a decrease in the odds of high_use by 14-50%.
Variables goout (increases high_use odds) and studytime (decreases high_use odds) are interpreted analogously and neither containts 1 in the 95% CI.
The sexM is interpreted in relation to implicit sexF: being male increases the high_use odds by around 122% with 95% CI of [1.32, 3.76].
It seems that the results correspond to the hypotheses. It feels like famrel would require more work although the interaction term of sex*famrel is not significant when added to model.2 (data not shown).
observed <- alc$high_use
predicted <- predict(model.2, type = "response") > 0.5
# 2x2 tabulation, confusion matrix
table(observed, predicted)
## predicted
## observed FALSE TRUE
## FALSE 230 29
## TRUE 59 52
Most cases are classified correctly.
Below are the predictions versus actual high_use values, plotted separately for each explanatory variable.
# Get predictions, compare against true values
predicted_abs <- predict(model.2, type = "response")
df <- data.frame(cbind(observed, predicted, predicted_abs,
alc[, interesting_vars[3:6]]))
df$matches <- df$observed == df$predicted
# Plot by explanatory variables
prob.expl <- function(df, ind.var, expl.var, col.var) {
g <- ggplot(df, aes(x = .data[[ind.var]],
y = .data[[expl.var]])) +
geom_point(alpha = 0.5, aes(col = .data[[col.var]])) +
geom_smooth(method = "loess", col = "black", size = 0.5) +
theme_bw()
g
}
g1 <- prob.expl(df, "predicted_abs", "famrel", "matches")
g2 <- prob.expl(df, "predicted_abs", "goout", "matches")
g3 <- prob.expl(df, "predicted_abs", "sex", "matches")
g4 <- prob.expl(df, "predicted_abs", "studytime", "matches")
ggarrange(g1, g2, g3, g4, ncol = 2, nrow = 2, labels = c("A", "B", "C", "D"),
common.legend = TRUE, legend = "bottom")
From the plot above we see where the predictions of high_useare right (greenish points) and wrong (red points). For famrel (A), sex (C) and studytime (D), the predictions seem to be false mostly around probability 0.5. With goout, high_use appears to be poorly predicted when goout is low and predicted value is in the range of [0.5, 0.7].
Quantify at which probability values mismatches tend to occur:
# Predictions equal observation
round(table(cut_interval(df[df$matches == TRUE, c("predicted_abs")], length=0.20)) /
length(df$matches), 2)
##
## [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8] (0.8,1]
## 0.36 0.22 0.08 0.08 0.02
# False predictions
round(table(cut_interval(df[df$matches == FALSE, c("predicted_abs")], length=0.20)) /
length(df$matches), 2)
##
## [0,0.2] (0.2,0.4] (0.4,0.6] (0.6,0.8]
## 0.06 0.05 0.10 0.03
This confirms that most false predictions are made around the probability 0.5.
The training error of the model is 0.24, i.e. lowish. If we just guessed using, say, goout, the error would be around 0.28, which is not very bad. Of course the guessing is not very random, since we already know that goout is an important predictor.
# Training error, model.2
sum(df$matches == FALSE) / length(df$matches)
## [1] 0.2378378
# Guess high_use by goout
guesses <- rep(FALSE, times = nrow(df))
guesses[df$goout > mean(df$goout)] <- TRUE
table(df$observed, guesses)
## guesses
## FALSE TRUE
## FALSE 198 61
## TRUE 41 70
sum(guesses != df$observed) / nrow(df)
## [1] 0.2756757
Follow the code in DataCamp for this one
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, df$predicted_abs)
## [1] 0.2378378
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = model.2, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2486486
The test set performance is a bit better than the DataCamp baseline, 0.25 versus 0.26.
# Select all explanatory variables available
training.vars <- colnames(alc)[1:36] # Discard non-joined .m and .p variables
# Discard alcohol consumption variables and technical variables
training.vars <- training.vars[!(
training.vars %in% c("Dalc", "Walc", "n", "id.p", "id.m"))]
Now, with enough compute we could just try all the variables, but that would amount to 2^32 = 4294967296 models. We instead use stepwise selection by AIC starting with the full model and progressing backwards. We also assume R handles the factor variable to dummy (treatment) encoding. (Note the problems of stepwise regression. It would be advisable to use other methods e.g. through Caret.)
library(MASS)
# Run stepwise AIC selection on full model, keep the model and its AIC
final.m <- stepAIC(
# Full model
glm(paste0("high_use", " ~ ", paste(training.vars[-1], collapse=" + ")),
family = binomial, data = alc),
direction = "backward", trace = FALSE,
# What to keep from models
keep = function(model, aic) { list(model = model, aic = aic) } )
Then perform 10-fold CV for the models
# Init
CVs <- rep(NULL, ncol(final.m$keep)) # Test error
ERRs <- rep(NULL, ncol(final.m$keep)) # Train error
AICs <- rep(NULL, ncol(final.m$keep))
Nvar <- rep(NULL, ncol(final.m$keep))
for (i in 1:ncol(final.m$keep)) { # Each column is a model
interim.m <- final.m$keep[, i][1]$model
CVs[i] <- cv.glm(data = alc, cost = loss_func, glmfit = interim.m, K = 10)$delta[1]
ERRs[i] <- loss_func(alc$high_use, predict(interim.m, type = "response"))
AICs[i] <- final.m$keep[, i][2]$aic
Nvar[i] <- length(final.m$keep[, i][1]$model$coefficients) - 1
}
CV.res <- data.frame(test = CVs, train = ERRs, AIC = AICs, Nvar = Nvar)
# Somewhat unsatisfactory graph with points
g1 <- ggplot(CV.res, aes(x = Nvar, y = test)) + geom_line() +
geom_point(aes(x = Nvar, y = test, color = AIC), size = 5, alpha = 0.8) +
theme_bw() + xlab("Number of variables") + ylab("Test set error") +
ggtitle("10-fold CV on backward stepwise variable selection with AIC") +
scale_color_gradient(low="blue", high="red")
g1
# Better graph with both test and training error, and AIC separately
errors <- gather(CV.res, type, error, c("test", "train"))
g1 <- ggplot(errors, aes(x = Nvar, color = type)) + geom_line(aes(y = error)) +
theme_bw() + xlab("Number of variables") + ylab("Test set error")
g2 <- ggplot(errors, aes(x = Nvar)) + geom_line(aes(y = AIC)) +
theme_bw() + xlab("Number of variables") + ylab("AIC")
g <- ggarrange(g1, g2, ncol = 2, nrow = 1, widths = c(2, 1.5),
common.legend = FALSE, legend = "left")
annotate_figure(g, top = text_grob("10-fold CV on backward stepwise variable selection with AIC"))
The final model was
summary(final.m)
##
## Call:
## glm(formula = high_use ~ sex + address + famsize + reason + guardian +
## activities + romantic + famrel + goout + health + absences,
## family = binomial, data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8167 -0.7154 -0.4262 0.5635 2.7612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.14937 0.86318 -2.490 0.012772 *
## sexM 0.98275 0.28722 3.422 0.000623 ***
## addressU -0.89491 0.33661 -2.659 0.007847 **
## famsizeLE3 0.47009 0.29971 1.569 0.116759
## reasonhome 0.31183 0.35612 0.876 0.381229
## reasonother 1.24517 0.47895 2.600 0.009328 **
## reasonreputation -0.28654 0.36969 -0.775 0.438278
## guardianmother -0.68303 0.32428 -2.106 0.035179 *
## guardianother 0.41726 0.71637 0.582 0.560256
## activitiesyes -0.51481 0.28383 -1.814 0.069710 .
## romanticyes -0.50684 0.30464 -1.664 0.096171 .
## famrel -0.50627 0.15580 -3.249 0.001156 **
## goout 0.91148 0.13829 6.591 4.36e-11 ***
## health 0.16738 0.10359 1.616 0.106138
## absences 0.09152 0.02373 3.856 0.000115 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 335.02 on 355 degrees of freedom
## AIC: 365.02
##
## Number of Fisher Scoring iterations: 5
Removing all non-significant explanatory variables and iterating yields a model with 0.20 prediction error and 0.22 test set error in 10-fold CV:
parsimonious.m <- glm(formula = high_use ~ sex + address + reason + guardian +
activities + famrel + goout + absences, family = binomial, data = alc)
summary(parsimonious.m)
##
## Call:
## glm(formula = high_use ~ sex + address + reason + guardian +
## activities + famrel + goout + absences, family = binomial,
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8386 -0.7558 -0.4676 0.6257 2.8133
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.65723 0.77828 -2.129 0.033225 *
## sexM 1.09542 0.27999 3.912 9.14e-05 ***
## addressU -0.84765 0.33097 -2.561 0.010435 *
## reasonhome 0.24841 0.34895 0.712 0.476533
## reasonother 1.06110 0.46814 2.267 0.023412 *
## reasonreputation -0.35044 0.36541 -0.959 0.337531
## guardianmother -0.65433 0.31871 -2.053 0.040071 *
## guardianother 0.27975 0.69303 0.404 0.686459
## activitiesyes -0.55234 0.28067 -1.968 0.049075 *
## famrel -0.45739 0.15011 -3.047 0.002311 **
## goout 0.87831 0.13550 6.482 9.06e-11 ***
## absences 0.08739 0.02335 3.743 0.000182 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 341.90 on 358 degrees of freedom
## AIC: 365.9
##
## Number of Fisher Scoring iterations: 5
loss_func(alc$high_use, predict(parsimonious.m, type = "response"))
## [1] 0.1972973
cv.glm(data = alc, cost = loss_func, glmfit = parsimonious.m, K = 10)$delta[1]
## [1] 0.2216216
(more chapters to be added similarly as we proceed with the course!)